-
Notifications
You must be signed in to change notification settings - Fork 2
fix: ensure genes_gnomad_convert_v4
picks mane_select
and canonical
transcripts
#101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughThe changes update environment configurations and data processing for gnomAD gene constraints. A new polars dependency is introduced in the environment file, while a previously maintained environment file for polars is removed. In the gnomAD rules, an inline Python function is removed and replaced with a script call to handle data conversion. The referenced script is enhanced with additional operations for DataFrame processing, such as casting non-string columns to strings and filling null values. Changes
Sequence Diagram(s)sequenceDiagram
participant SNK as Snakemake Rule
participant PY as Python Script
participant DF as DataFrame
SNK->>PY: Execute "gnomad_constraints_v4_to_tsv.py"
PY->>DF: Join with ensembl_xlink & perform transformations
DF-->>PY: Return processed data (cast columns, fill nulls)
PY->>SNK: Output TSV file
Poem
Warning There were issues while running some tools. Please review the errors and either fix the tool’s configuration or disable the tool if it’s a critical failure. 🔧 YAMLlint (1.35.1)rules/work/genes/envs/polars.yaml[Errno 2] No such file or directory: 'rules/work/genes/envs/polars.yaml' ✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
Please check for formatting and linting issues. You can use
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (4)
rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py (4)
49-52
: Use idiomatic Python boolean comparisonsAvoid explicit comparisons to boolean literals.
-discard = df.filter(pl.col("mane_or_canonical") == False).select( +discard = df.filter(~pl.col("mane_or_canonical")).select( ["gene", "gene_id"] )🧰 Tools
🪛 Ruff (0.8.2)
49-49: Avoid equality comparisons to
False
; useif not pl.col("mane_or_canonical"):
for false checksReplace with
not pl.col("mane_or_canonical")
(E712)
52-52
: Consider using a logging mechanism instead of printFor better integration with the Snakemake workflow, consider using a proper logging mechanism instead of print statements.
-print("No mane_select or canonical transcript found for ", discard.select("gene")) +import logging +logging.info(f"No mane_select or canonical transcript found for {discard.select('gene')}")
54-55
: Use idiomatic Python boolean comparisonsAvoid explicit comparisons to boolean literals.
-df = ( - df.filter(pl.col("mane_or_canonical") == True) +df = ( + df.filter(pl.col("mane_or_canonical"))🧰 Tools
🪛 Ruff (0.8.2)
54-54: Avoid equality comparisons to
True
; useif pl.col("mane_or_canonical"):
for truth checksReplace with
pl.col("mane_or_canonical")
(E712)
67-71
: Handle undefinedsnakemake
variable for better IDE integrationThe
snakemake
variable is undefined in this file and only available when the script is run through Snakemake. Consider adding a conditional import or check to make the script more IDE-friendly and to address static analysis warnings.if __name__ == "__main__": + try: + # When run by snakemake main( snakemake.input.tsv, snakemake.input.xlink_ensembl, snakemake.output.tsv, snakemake.output.tsv_md5, ) + except NameError: + # For development/testing outside of snakemake + import argparse + parser = argparse.ArgumentParser() + parser.add_argument("--constraint-tsv", required=True) + parser.add_argument("--ensembl-xlink-tsv", required=True) + parser.add_argument("--output-tsv", required=True) + parser.add_argument("--output-tsv-md5", required=True) + args = parser.parse_args() + main( + args.constraint_tsv, + args.ensembl_xlink_tsv, + args.output_tsv, + args.output_tsv_md5 + )🧰 Tools
🪛 Ruff (0.8.2)
67-67: Undefined name
snakemake
(F821)
68-68: Undefined name
snakemake
(F821)
69-69: Undefined name
snakemake
(F821)
70-70: Undefined name
snakemake
(F821)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
environment.yml
(1 hunks)rules/work/genes/gnomad.smk
(1 hunks)rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py
(1 hunks)
🧰 Additional context used
🪛 Ruff (0.8.2)
rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py
49-49: Avoid equality comparisons to False
; use if not pl.col("mane_or_canonical"):
for false checks
Replace with not pl.col("mane_or_canonical")
(E712)
54-54: Avoid equality comparisons to True
; use if pl.col("mane_or_canonical"):
for truth checks
Replace with pl.col("mane_or_canonical")
(E712)
62-62: Undefined name shell
(F821)
67-67: Undefined name snakemake
(F821)
68-68: Undefined name snakemake
(F821)
69-69: Undefined name snakemake
(F821)
70-70: Undefined name snakemake
(F821)
🔇 Additional comments (3)
environment.yml (1)
56-57
: LGTM - appropriate dependency addedThe addition of Polars with version constraints (>=1.23,<2) is correctly formatted and includes a clear explanatory comment detailing its purpose.
rules/work/genes/gnomad.smk (1)
109-110
: Good refactoring approachMoving the complex processing logic from an inline function to a dedicated script improves maintainability and readability, which aligns well with the PR objectives.
rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py (1)
41-48
:❓ Verification inconclusive
Verify sorting logic for transcript selection
The current sorting and grouping logic selects the last record for each gene after sorting. Ensure that this approach correctly prioritizes MANE select and canonical transcripts as intended. Consider whether the sort order
["gene", "mane_select", "canonical", ...]
gives the right precedence.
🏁 Script executed:
#!/bin/bash # Check the data structure and values for mane_select and canonical fields head -n 10 $(find work/download/genes/gnomad -name "*.constraint_metrics.tsv" | head -n 1) | grep -E "mane_select|canonical" -A 2Length of output: 194
Action Required: Manually Verify Sorting Priority for Transcript Selection
- The current logic sorts records by
["gene", "mane_select", "canonical", "cds_length", "transcript"]
and then selects the last record per gene. This approach should, in theory, pick transcripts wheremane_select
orcanonical
are flagged (if booleans sort as expected).- The automated check did not yield output because the expected directory (
work/download/genes/gnomad
) was not found. Please verify that the file paths are correct or run a manual check to ensure that themane_select
andcanonical
fields contain the expected values and that the sorting indeed prioritizes the desired transcripts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (5)
rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py (5)
36-39
: Consider adding error handling for file operationsThe file reading operations don't have error handling. If one of the input files doesn't exist or has an unexpected format, the script will crash without a helpful error message.
-df = pl.read_csv(constraint_tsv_path, separator="\t", null_values="NA") -ensembl_xlink = pl.read_csv( - ensembl_xlink_tsv_path, separator="\t", null_values="NA" -) +try: + df = pl.read_csv(constraint_tsv_path, separator="\t", null_values="NA") + ensembl_xlink = pl.read_csv( + ensembl_xlink_tsv_path, separator="\t", null_values="NA" + ) +except Exception as e: + print(f"Error reading input files: {e}") + raise
49-51
: Use Pythonic boolean comparisonUsing
== False
is less Pythonic than using~
(not) operator ornot
with Polars expressions.-discard = df.filter(pl.col("mane_or_canonical") == False).select( +discard = df.filter(~pl.col("mane_or_canonical")).select( ["gene", "gene_id"] )🧰 Tools
🪛 Ruff (0.8.2)
49-49: Avoid equality comparisons to
False
; useif not pl.col("mane_or_canonical"):
for false checksReplace with
not pl.col("mane_or_canonical")
(E712)
53-55
: Use Pythonic boolean comparisonSimilar to the previous comment, use the direct boolean expression instead of
== True
.-df = ( - df.filter(pl.col("mane_or_canonical") == True) +df = ( + df.filter(pl.col("mane_or_canonical"))🧰 Tools
🪛 Ruff (0.8.2)
54-54: Avoid equality comparisons to
True
; useif pl.col("mane_or_canonical"):
for truth checksReplace with
pl.col("mane_or_canonical")
(E712)
65-71
: Add error handling and documentation for Snakemake dependencyThe script assumes the
snakemake
object is available, which is true when run through Snakemake but might cause issues if someone tries to run the script directly. Consider adding a comment or try-except block to handle this dependency.if __name__ == "__main__": + # This script is designed to be run by Snakemake + # The snakemake object is automatically available in the namespace + try: main( snakemake.input.tsv, snakemake.input.xlink_ensembl, snakemake.output.tsv, snakemake.output.tsv_md5, ) + except NameError as e: + print("This script is designed to be run by Snakemake") + print(f"Error: {e}") + import sys + sys.exit(1)🧰 Tools
🪛 Ruff (0.8.2)
67-67: Undefined name
snakemake
(F821)
68-68: Undefined name
snakemake
(F821)
69-69: Undefined name
snakemake
(F821)
70-70: Undefined name
snakemake
(F821)
3-72
: Add docstrings for better documentationThe code would benefit from docstrings to explain the purpose of the script and the
main
function, including its parameters and what it does.+""" +Script to convert gnomAD constraints data to TSV format. + +This script processes gnomAD constraint data, prioritizing mane_select and +canonical transcripts, and converts it to a standardized TSV format. +""" import polars as pl def main( constraint_tsv_path, ensembl_xlink_tsv_path, output_tsv_path, output_tsv_md5_path ): + """ + Process gnomAD constraint data and convert it to TSV format. + + Args: + constraint_tsv_path: Path to the input constraint TSV file + ensembl_xlink_tsv_path: Path to the Ensembl cross-reference TSV file + output_tsv_path: Path to write the output TSV file + output_tsv_md5_path: Path to write the MD5 checksum of the output TSV file + + The function prioritizes mane_select and canonical transcripts when + selecting entries for each gene. + """🧰 Tools
🪛 Ruff (0.8.2)
49-49: Avoid equality comparisons to
False
; useif not pl.col("mane_or_canonical"):
for false checksReplace with
not pl.col("mane_or_canonical")
(E712)
54-54: Avoid equality comparisons to
True
; useif pl.col("mane_or_canonical"):
for truth checksReplace with
pl.col("mane_or_canonical")
(E712)
62-62: Undefined name
snakemake
(F821)
67-67: Undefined name
snakemake
(F821)
68-68: Undefined name
snakemake
(F821)
69-69: Undefined name
snakemake
(F821)
70-70: Undefined name
snakemake
(F821)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
environment.yml
(1 hunks)rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- environment.yml
🧰 Additional context used
🧠 Learnings (1)
rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py (1)
Learnt from: tedil
PR: varfish-org/varfish-db-downloader#101
File: rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py:62-62
Timestamp: 2025-03-10T08:16:40.589Z
Learning: When writing Python scripts for Snakemake (used with the `script:` directive), the `snakemake` object is automatically available in the script's namespace. You can access Snakemake's shell function using `snakemake.shell()` rather than importing subprocess.
🪛 Ruff (0.8.2)
rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py
49-49: Avoid equality comparisons to False
; use if not pl.col("mane_or_canonical"):
for false checks
Replace with not pl.col("mane_or_canonical")
(E712)
54-54: Avoid equality comparisons to True
; use if pl.col("mane_or_canonical"):
for truth checks
Replace with pl.col("mane_or_canonical")
(E712)
62-62: Undefined name snakemake
(F821)
67-67: Undefined name snakemake
(F821)
68-68: Undefined name snakemake
(F821)
69-69: Undefined name snakemake
(F821)
70-70: Undefined name snakemake
(F821)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Run-Test-Mode
🔇 Additional comments (6)
rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py (6)
1-5
: Good use of Polars for data processingUsing Polars is an excellent choice for handling dataframe operations, providing better performance and more modern API than pandas. The function signature clearly defines the required inputs and outputs.
6-26
: LGTM: Well-structured column mappingThe column mapping dictionary is clearly defined and easy to maintain if column names change in the future.
30-34
: Elegant approach for constructing output columnsThe approach of concatenating lists to construct the final columns list is clean and maintainable.
41-48
: Good filtering logic for mane_select and canonical transcriptsThe implementation correctly addresses the PR objective by prioritizing transcripts based on mane_select and canonical flags, which improves reliability over sort-order dependency.
56-60
: LGTM: Good data transformation pipelineThe data transformation pipeline is well-structured, performing column additions, joins, and null filling in a clear sequence.
62-62
: Correct usage of Snakemake shell functionBased on the retrieved learning, this is the correct way to use Snakemake's shell function for executing shell commands in a Snakemake script.
🧰 Tools
🪛 Ruff (0.8.2)
62-62: Undefined name
snakemake
(F821)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py (1)
50-56
: Use Pythonic boolean comparisonsCurrent boolean comparisons use explicit equality checks with True/False literals, which is not idiomatic Python.
- discard = df.filter(pl.col("mane_or_canonical") == False).select( + discard = df.filter(~pl.col("mane_or_canonical")).select( ["gene", "gene_id"] ) print("No mane_select or canonical transcript found for ", discard.select("gene")) df = ( - df.filter(pl.col("mane_or_canonical") == True) + df.filter(pl.col("mane_or_canonical"))🧰 Tools
🪛 Ruff (0.8.2)
50-50: Avoid equality comparisons to
False
; useif not pl.col("mane_or_canonical"):
for false checksReplace with
not pl.col("mane_or_canonical")
(E712)
55-55: Avoid equality comparisons to
True
; useif pl.col("mane_or_canonical"):
for truth checksReplace with
pl.col("mane_or_canonical")
(E712)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py
(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py (1)
Learnt from: tedil
PR: varfish-org/varfish-db-downloader#101
File: rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py:62-62
Timestamp: 2025-03-10T08:16:40.589Z
Learning: When writing Python scripts for Snakemake (used with the `script:` directive), the `snakemake` object is automatically available in the script's namespace. You can access Snakemake's shell function using `snakemake.shell()` rather than importing subprocess.
🪛 Ruff (0.8.2)
rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py
50-50: Avoid equality comparisons to False
; use if not pl.col("mane_or_canonical"):
for false checks
Replace with not pl.col("mane_or_canonical")
(E712)
55-55: Avoid equality comparisons to True
; use if pl.col("mane_or_canonical"):
for truth checks
Replace with pl.col("mane_or_canonical")
(E712)
68-68: Undefined name snakemake
(F821)
69-69: Undefined name snakemake
(F821)
70-70: Undefined name snakemake
(F821)
71-71: Undefined name snakemake
(F821)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Run-Test-Mode
🔇 Additional comments (6)
rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py (6)
1-3
: Good use of imports and dependenciesYou've correctly imported the Polars library for data manipulation and the shell function from snakemake.shell based on the previous feedback. This is the right approach for Snakemake scripts.
4-35
: Well-structured data mapping and column definitionsThe code defines a clear mapping between source and destination column names, as well as additional columns needed. The organization of this section makes the code readable and maintainable.
37-40
: Effective data loading with PolarsThe use of Polars for reading TSV files with appropriate parameters (separator and null_values) is a good choice. Polars provides better performance than pandas for these types of operations.
42-53
: Clear data transformation pipelineThe data transformation pipeline is well-constructed with appropriate sorting, filtering, and grouping operations to ensure the correct transcript selection.
🧰 Tools
🪛 Ruff (0.8.2)
50-50: Avoid equality comparisons to
False
; useif not pl.col("mane_or_canonical"):
for false checksReplace with
not pl.col("mane_or_canonical")
(E712)
62-63
: Proper MD5 checksum generation using Snakemake shellYou've properly used the Snakemake shell function for generating the MD5 checksum. This is the correct approach based on the previous feedback and retrieved learning that the snakemake object is automatically available in the script's namespace.
66-72
: Proper use of Snakemake integrationThe script correctly uses the snakemake object to access inputs and outputs when executed by Snakemake. While static analysis tools may flag "undefined name
snakemake
", this is expected behavior as the snakemake object is injected into the script's namespace by Snakemake itself.🧰 Tools
🪛 Ruff (0.8.2)
68-68: Undefined name
snakemake
(F821)
69-69: Undefined name
snakemake
(F821)
70-70: Undefined name
snakemake
(F821)
71-71: Undefined name
snakemake
(F821)
closes #103 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (8)
rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py (8)
1-3
: Consider adding a docstring to explain script's purpose.The script lacks a module-level docstring that would help users understand its purpose, inputs, and expected outputs. Since this script is replacing inline functionality from a Snakemake file, clear documentation would be valuable.
import polars as pl from snakemake.shell import shell +""" +Process gnomAD v4 constraint data prioritizing MANE select and canonical transcripts. + +This script converts gnomAD v4 constraint data to the VarFish TSV format, ensuring +that MANE select and canonical transcripts are prioritized for each gene. +""" +
4-6
: Add a function docstring with parameter descriptions.Adding a proper docstring to the main function would improve code readability and maintenance.
def main( constraint_tsv_path, ensembl_xlink_tsv_path, output_tsv_path, output_tsv_md5_path ): + """ + Convert gnomAD v4 constraint data to VarFish TSV format. + + Args: + constraint_tsv_path: Path to the input gnomAD constraint TSV file + ensembl_xlink_tsv_path: Path to the Ensembl cross-reference TSV file + output_tsv_path: Path where the output TSV file will be written + output_tsv_md5_path: Path where the MD5 checksum of the output file will be written + """
42-49
: Add a comment explaining the transcript selection logic.The sorting and filtering logic is complex and crucial to understanding how the preferred transcript is selected for each gene. A comment would clarify the intent.
df = ( + # Sort to prioritize MANE select and canonical transcripts, then by CDS length + # Group by gene and take the last record (highest priority transcript per gene) df.sort(by=["gene", "mane_select", "canonical", "cds_length", "transcript"]) .filter(pl.col("gene_id").str.starts_with("ENSG")) .group_by(["gene"]) .last() .sort(["gene", "mane_select", "canonical"]) .with_columns(mane_or_canonical=pl.col("canonical") | pl.col("mane_select")) )
50-53
: Useis_not()
for boolean column comparison.Instead of comparing a boolean column to
False
, use the more idiomatic Polars expression withis_not()
.- discard = df.filter(pl.col("mane_or_canonical") == False).select( + discard = df.filter(pl.col("mane_or_canonical").is_not()).select( ["gene", "gene_id"] ) print("No mane_select or canonical transcript found for ", discard.select("gene"))🧰 Tools
🪛 Ruff (0.8.2)
50-50: Avoid equality comparisons to
False
; useif not pl.col("mane_or_canonical"):
for false checksReplace with
not pl.col("mane_or_canonical")
(E712)
53-53
: Consider using logging instead of print.Using Python's logging module would provide better control over output verbosity and destination, which is especially important in a Snakemake workflow.
+import logging + +# At the top of the file +logging.basicConfig(level=logging.INFO) + +# Replace the print statement with - print("No mane_select or canonical transcript found for ", discard.select("gene")) + logging.info(f"No mane_select or canonical transcript found for {discard.select('gene')}")
55-55
: Useis_()
for boolean column comparison.Instead of comparing a boolean column to
True
, use the more idiomatic Polars expression withis_()
.df = ( - df.filter(pl.col("mane_or_canonical") == True) + df.filter(pl.col("mane_or_canonical").is_()) .rename(columns_src_dst) .with_columns(pl.lit(None).alias(col) for col in additional_columns) .join(ensembl_xlink, on=["ensembl_transcript_id"]) .select(columns_dst) )🧰 Tools
🪛 Ruff (0.8.2)
55-55: Avoid equality comparisons to
True
; useif pl.col("mane_or_canonical"):
for truth checksReplace with
pl.col("mane_or_canonical")
(E712)
57-57
: Consider using list comprehension for clarity.The current code uses generator expression inside
with_columns()
. For better readability, consider using a list comprehension.- .with_columns(pl.lit(None).alias(col) for col in additional_columns) + .with_columns([pl.lit(None).alias(col) for col in additional_columns])
1-72
: Add error handling for robust execution.The script lacks error handling which could be helpful for diagnosing issues in the Snakemake workflow. Consider adding try-except blocks around critical operations.
def main( constraint_tsv_path, ensembl_xlink_tsv_path, output_tsv_path, output_tsv_md5_path ): + try: columns_src_dst = { # ... existing code ... } # ... existing code ... df.write_csv(output_tsv_path, separator="\t") shell(f"md5sum {output_tsv_path} > {output_tsv_md5_path}") + except Exception as e: + logging.error(f"Error processing gnomAD constraints: {e}") + raise🧰 Tools
🪛 Ruff (0.8.2)
50-50: Avoid equality comparisons to
False
; useif not pl.col("mane_or_canonical"):
for false checksReplace with
not pl.col("mane_or_canonical")
(E712)
55-55: Avoid equality comparisons to
True
; useif pl.col("mane_or_canonical"):
for truth checksReplace with
pl.col("mane_or_canonical")
(E712)
67-67: Undefined name
snakemake
(F821)
68-68: Undefined name
snakemake
(F821)
69-69: Undefined name
snakemake
(F821)
70-70: Undefined name
snakemake
(F821)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py
(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py (1)
Learnt from: tedil
PR: varfish-org/varfish-db-downloader#101
File: rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py:62-62
Timestamp: 2025-03-10T08:16:40.589Z
Learning: When writing Python scripts for Snakemake (used with the `script:` directive), the `snakemake` object is automatically available in the script's namespace. You can access Snakemake's shell function using `snakemake.shell()` rather than importing subprocess.
🪛 Ruff (0.8.2)
rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py
50-50: Avoid equality comparisons to False
; use if not pl.col("mane_or_canonical"):
for false checks
Replace with not pl.col("mane_or_canonical")
(E712)
55-55: Avoid equality comparisons to True
; use if pl.col("mane_or_canonical"):
for truth checks
Replace with pl.col("mane_or_canonical")
(E712)
67-67: Undefined name snakemake
(F821)
68-68: Undefined name snakemake
(F821)
69-69: Undefined name snakemake
(F821)
70-70: Undefined name snakemake
(F821)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Run-Test-Mode
🔇 Additional comments (2)
rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py (2)
62-62
: Correct use of Snakemake shell function.Great job using the
shell
function fromsnakemake.shell
for generating the MD5 checksum. This is the recommended approach for Snakemake scripts.
65-71
: Appropriate handling of Snakemake integration.The script correctly accesses the
snakemake
object which is automatically available in the namespace when run through Snakemake'sscript:
directive. This is the idiomatic way to write Snakemake scripts.🧰 Tools
🪛 Ruff (0.8.2)
67-67: Undefined name
snakemake
(F821)
68-68: Undefined name
snakemake
(F821)
69-69: Undefined name
snakemake
(F821)
70-70: Undefined name
snakemake
(F821)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (5)
rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py (5)
33-33
: Consider adding a comment explaining the slice operationThe code skips the first value from
columns_src_dst.values()
(which is "ensembl_transcript_id"). This is intentional since it's included separately in the hardcoded part of the list, but this might not be immediately obvious to other developers.- + list(columns_src_dst.values())[1:] + # Skip first value "ensembl_transcript_id" as it's already included via the join operation + + list(columns_src_dst.values())[1:]
50-52
: Avoid direct Boolean comparisons in Polars expressionsAccording to Polars best practices, it's better to avoid equality comparisons to boolean literals.
- discard = df.filter(pl.col("mane_or_canonical") == False).select( + discard = df.filter(~pl.col("mane_or_canonical")).select(🧰 Tools
🪛 Ruff (0.8.2)
50-50: Avoid equality comparisons to
False
; useif not pl.col("mane_or_canonical"):
for false checksReplace with
not pl.col("mane_or_canonical")
(E712)
53-53
: Consider using proper logging instead of printFor better integration with Snakemake's logging system, consider using the Python logging module instead of print statements.
- print("No mane_select or canonical transcript found for ", discard.select("gene")) + import logging + logging.info(f"No mane_select or canonical transcript found for {discard.select('gene')}")
55-55
: Avoid direct Boolean comparisons in Polars expressionsSimilar to the previous comment, replace the explicit boolean comparison with a more idiomatic expression.
- df.filter(pl.col("mane_or_canonical") == True) + df.filter(pl.col("mane_or_canonical"))🧰 Tools
🪛 Ruff (0.8.2)
55-55: Avoid equality comparisons to
True
; useif pl.col("mane_or_canonical"):
for truth checksReplace with
pl.col("mane_or_canonical")
(E712)
67-73
: Add a comment explaining thesnakemake
objectThe
snakemake
object is automatically available when the script is run through Snakemake'sscript:
directive, but this might not be obvious to all developers.if __name__ == "__main__": + # The snakemake object is automatically available when this script is run through + # Snakemake's script: directive main( snakemake.input.tsv, snakemake.input.xlink_ensembl, snakemake.output.tsv, snakemake.output.tsv_md5, )🧰 Tools
🪛 Ruff (0.8.2)
69-69: Undefined name
snakemake
(F821)
70-70: Undefined name
snakemake
(F821)
71-71: Undefined name
snakemake
(F821)
72-72: Undefined name
snakemake
(F821)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py
(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py (1)
Learnt from: tedil
PR: varfish-org/varfish-db-downloader#101
File: rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py:62-62
Timestamp: 2025-03-10T08:16:40.589Z
Learning: When writing Python scripts for Snakemake (used with the `script:` directive), the `snakemake` object is automatically available in the script's namespace. You can access Snakemake's shell function using `snakemake.shell()` rather than importing subprocess.
🪛 Ruff (0.8.2)
rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py
50-50: Avoid equality comparisons to False
; use if not pl.col("mane_or_canonical"):
for false checks
Replace with not pl.col("mane_or_canonical")
(E712)
55-55: Avoid equality comparisons to True
; use if pl.col("mane_or_canonical"):
for truth checks
Replace with pl.col("mane_or_canonical")
(E712)
69-69: Undefined name snakemake
(F821)
70-70: Undefined name snakemake
(F821)
71-71: Undefined name snakemake
(F821)
72-72: Undefined name snakemake
(F821)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Run-Test-Mode
🔇 Additional comments (3)
rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py (3)
1-2
: Good fix: Properly importing shell from snakemake.shellThis correctly implements the suggestion from previous reviews to use the Snakemake shell function rather than relying on undefined shell functions or subprocess.
42-49
: Good implementation: Reliable transcript selection logicThe sorting and filtering logic ensures priority selection of transcripts with "mane_select" and "canonical" attributes, which directly addresses the PR objective of making the selection more reliable rather than depending on CLI tool sort order.
37-40
:❓ Verification inconclusive
Verify ensembl_xlink_tsv schema matches expectations
The code joins data using the "ensembl_transcript_id" column, but it's worth verifying that this column exists in the ensembl_xlink dataset with the expected values.
Also applies to: 58-58
🏁 Script executed:
#!/bin/bash # Check for the existence and format of the ensembl_transcript_id column in the xlink file head -n 5 $(find . -name "*.tsv" | grep -i ensembl | grep -i xlink) | cut -f 1-5Length of output: 84
Below is a new shell script designed to help verify the schema of the TSV file. Please run it and manually check that the header row contains the
"ensembl_transcript_id"
column with the expected format.Action Required: Verify ensembl_xlink Data Schema
- The script searches for TSV files whose names include both
"ensembl"
and"xlink"
.- For each file found, it prints the header line with each column on its own line.
- After execution, please ensure that one of the header entries is exactly
"ensembl_transcript_id"
and that its values (from subsequent lines) match the expected format.Once you’ve run the script, please verify that the
"ensembl_transcript_id"
column is present and correctly formatted.
🏁 Script executed:
#!/bin/bash # Find all TSV files containing both 'ensembl' and 'xlink' in their filename and display header columns files=$(find . -iname "*ensembl*xlink*.tsv") if [ -z "$files" ]; then echo "No TSV files matching 'ensembl*xlink*.tsv' were found." exit 1 fi for file in $files; do echo "File: $file" echo "Header columns (each on a new line):" head -n 1 "$file" | tr "\t" "\n" echo "----------------------" doneLength of output: 197
Manual Verification of ensembl_xlink Dataset Required
The automated searches did not locate any TSV file matching the expected pattern, which means we couldn’t confirm the presence of the
"ensembl_transcript_id"
column automatically. Please ensure manually that the file referenced by theensembl_xlink_tsv_path
variable:
- Exists in the repository: Verify that the expected file (with a name related to “ensembl” and “xlink”) is present and correctly referenced.
- Has the correct schema: Open the file and confirm that its header includes a column exactly named
"ensembl_transcript_id"
and that the column values meet the expected format.If the file is found under a different naming convention or location, please update the code accordingly. If it’s missing entirely, add it or adjust the logic that joins data on that column.
3a98e70
to
f8509f4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py (1)
66-67
: Duplicateif __name__ == "__main__"
checkThere are two identical
if __name__ == "__main__"
checks at lines 66 and 67. The first one is followed by an empty line, which suggests it might be a mistake.-if __name__ == "__main__": - if __name__ == "__main__": main(
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
rules/work/genes/envs/polars.yaml
(1 hunks)rules/work/genes/gnomad.smk
(1 hunks)rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py
(1 hunks)environment.yml
(1 hunks)rules/work/genes/envs/polars.yaml
(0 hunks)rules/work/genes/gnomad.smk
(0 hunks)rules/work/genes/gnomad.smk
(0 hunks)rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py
(1 hunks)rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py
(2 hunks)rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py
(0 hunks)rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py
(1 hunks)
✅ Files skipped from review due to trivial changes (2)
- rules/work/genes/envs/polars.yaml
- rules/work/genes/envs/polars.yaml
🚧 Files skipped from review as they are similar to previous changes (6)
- rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py
- environment.yml
- rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py
- rules/work/genes/gnomad.smk
- rules/work/genes/gnomad.smk
- rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py
🧰 Additional context used
🧠 Learnings (1)
rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py (1)
Learnt from: tedil
PR: varfish-org/varfish-db-downloader#101
File: rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py:62-62
Timestamp: 2025-03-10T08:16:40.589Z
Learning: When writing Python scripts for Snakemake (used with the `script:` directive), the `snakemake` object is automatically available in the script's namespace. You can access Snakemake's shell function using `snakemake.shell()` rather than importing subprocess.
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: Run-Test-Mode
- GitHub Check: Check-Remote-URLs
🔇 Additional comments (3)
rules/work/genes/scripts/gnomad_constraints_v4_to_tsv.py (3)
62-62
: Great fix using snakemake.shell!This change correctly addresses the previously identified issue with the undefined
shell
function. Usingsnakemake.shell()
is the proper approach when writing Python scripts for Snakemake'sscript:
directive, as thesnakemake
object is automatically available in the script's namespace.
42-56
: Good implementation for selecting preferred transcriptsThe sorting and filtering logic here implements the PR objective of ensuring
genes_gnomad_convert_v4
properly selectsmane_select
andcanonical
transcripts. By prioritizing these transcript types explicitly rather than relying on CLI sort order, you've made the code more reliable and readable.The diagnostic print statement for genes without preferred transcripts is also helpful for monitoring data quality.
🧰 Tools
🪛 Ruff (0.8.2)
50-50: Avoid equality comparisons to
False
; useif not pl.col("mane_or_canonical"):
for false checksReplace with
not pl.col("mane_or_canonical")
(E712)
55-55: Avoid equality comparisons to
True
; useif pl.col("mane_or_canonical"):
for truth checksReplace with
pl.col("mane_or_canonical")
(E712)
1-2
: Appropriate imports for the implementationThe imports look good - using Polars for the data manipulation, which aligns with the PR description of reimplementing using the Polars library, and importing the Snakemake shell function that's used in line 62.
The previous code relied on sort order and consistent cli tool behaviour; I suspect it may have been enough to simply remove the reverse flag from the "3,3r" sort parameter. However, to be on the safe side, we re-implemented the logic using polars. Should be easier to read and more stable now.
(This probably also needs to be done for
genes_gnomad_convert_v2_1_1
)Summary by CodeRabbit